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Abstract 

In this paper we propose an innovative learning algorithm - 
a variation of One-class v Support Vector Machines (SVMs) 
learning algorithm to produce sparser solutions with much 
reduced computational complexities. The proposed tech- 
nique returns an approximate solution, nearly as good as 
the solution set obtained by the classical approach, by mini- 
mizing the original risk function along with a regularization 
term. We introduce a bi-criterion optimization that helps 
guide the search towards the optimal set in much reduced 
time. The outcome of the proposed learning technique was 
compared with the benchmark one-class Support Vector ma- 
chines algorithm which more often leads to solutions with 
redundant support vectors. Through out the analysis, the 
problem size for both optimization routines was kept consis- 
tent. We have tested the proposed algorithm on a variety 
of data sources under different conditions to demonstrate 
the effectiveness. In all cases the proposed algorithm closely 
preserves the accuracy of standard one-class v SVMs while 
reducing both training time and test time by several factors. 

Keywords: Anomaly Detection, Optimization, 

Sparse, Scalability, Aeronautics 

1 Introduction 

Many problems in areas of interest to NASA, such as 
aviation safety and Earth science, have benefited and 
will continue to benefit from the use of data-driven 
methods for anomaly detection. For example, in avi- 
ation safety, many airlines have very large datasets rep- 
resenting the operation of their fleets of commercial air- 
craft. Most of this data represent normal operations 
of the aircraft — finding examples of anomalous opera- 
tion is comparable to the proverbial problem of finding 
a needle in a haystack. An algorithm to find anoma- 
lies in such a large dataset clearly needs to be fast and 
scalable. The algorithm also must be accurate, which 
requires leveraging as many properties of the dataset 
as possible. In particular, data from commercial air- 
craft contain continuous sequences, representing sensor 
data such as airspeed and altitude, as well as discrete 
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sequences, such as sequences of pilot switch presses. An 
algorithm that learns from such sequences will tend to 
outperform typical machine learning algorithms that as- 
sume that data collected at every instant in time is in- 
dependent from data collected at every other instant in 
time. 

We addressed the accuracy issue in [7], where we 
devised a Multiple Kernel Learning (MKL) version of 
one-class SVMs containing one kernel over discrete se- 
quences and one kernel over continuous sequences. We 
chose one-class SVMs as the basis of our developments 
because of its strong performance as reported by other 
researchers, its guarantee of optimality given a partic- 
ular training set, and the flexibility of kernel methods 
to utilize a variety of different types of features both 
in single kernel and multiple kernel methods. In [7], 
we demonstrated our algorithm’s effectiveness at find- 
ing anomalies within commercial aircraft data. How- 
ever, the running time of one-class SVMs is higher than 
for other algorithms that we use because of the need to 
solve an optimization problem. 

In this paper, we address the speed and scalability 
issue discussed above. We do this through a bi-criterion 
formulation of one-class SVM — that is, we add a crite- 
rion to the objective function that biases the algorithm 
toward a sparser solution, which we demonstrate theo- 
retically and experimentally. We show that our learning 
algorithm often has much lower training time than the 
classical one-class SVM learning algorithm. In some 
cases, our learning algorithm’s run time is higher, but 
we demonstrate that, in all these cases, our algorithm’s 
time to generate a classification (normal or anomalous) 
for a new data point is much lower. In spite of this, 
our algorithm’s performance is nearly the same as that 
of the classical one-class SVM in terms of how it clas- 
sifies new data. We achieved all these results without 
requiring any changes to the format of the data or any 
changes to the rest of the algorithm, such as the opti- 
mization problem solver, thereby making the algorithm 
easy to implement. 

In the following section we provide some back- 
ground research to speed up and scale Support Vector 
Machines. This will be followed by our motivation and 
contributions. In Section 3, we describe the optimiza- 
tion problem of original one-class support vector rna- 


chines model which is the underlying algorithm of our 
work. Subsequently, Section 4 discusses the bi-criterion 
optimization which is the heart of this paper, followed 
by some details on the solver. Experimental evidence of 
performance of the proposed technique is given in Sec- 
tion 5. Finally we conclude the paper with a discussion 
in Section 6. 

2 Background and Motivation 

Kernel based methods like one-class support vector ma- 
chines have a significant disadvantage in addressing seal- 
ability to large number of training points. With increas- 
ing training points, the training time and the memory 
requirements drastically increase and at the same time 
the prediction time which is proportional to the number 
of representative support vectors also increases. The 
number of representative support vectors also holds a 
proportional relationship with the number of training 
points. There have been several efforts to over come 
training and testing time scaling issues either by build- 
ing an online algorithm, a parallel batch algorithm, 
or a sophisticated scheme to select more informative 
training samples. A lot of researchers reported satis- 
factory contributions in multiple areas like data pre- 
processing, data compression, kernel modification [12] 
etc., while others have investigated more in the areas 
of optimization and solver development. Each of these 
tasks individually plays an important role in building 
the model. In [3] Burges and Scholkopf proposed “re- 
duced set” method in order to improve on classifica- 
tion speed and “virtual support vectors” method to im- 
prove on accuracy, however at the cost of some increased 
training time. Some papers talk about how to improve 
the performance of kernel based methods in general. 
Most of these literatures examine techniques for effi- 
cient matrix factorization, low rank approximation, etc. 
Schwaighofer and Tresp [17] conducted a comprehensive 
study on using some of these approaches to scale Gaus- 
sian process regression technique on large data sets. 
Asharaf et al. addresses the scalability problem of SVMs 
using cluster based training [1, 14] where some selected 
samples representing the cluster abstractions of the en- 
tire training data are used to build the model without 
compromising the generalized performance. However 
the outcome of cluster based training will typically de- 
pend on the performance of the clustering algorithm. 
Liang Lie-quan and Liang Ying-hong [11] used a mode 
sensitive procedure called “mean shift” algorithm for 
clustering purpose. Another popular technique is chuck- 
ing algorithms [18] which solves a smaller QP problem 
formed by samples corresponding to nonzero Lagrange 
multipliers. A vast amount of papers discuss iterative 
training of support vector machines (e.g. [19, 5]). There 


are separate examples of on-going research [13, 15, 8] 
looking for effective and efficient solvers that can han- 
dle large data sets and improve scalability of machine 
learning methods which may require solving optimiza- 
tion problems. 

The scope of our current effort is intentionally 
restricted to scaling up the batch version of classical one- 
class SVMs formulation [16] without having to change 
the optimization problem solver. We assume that the 
entire data set can fit into memory but we plan to 
extend our algorithm in the future to run online or in 
parallel. Moreover we pose the additional restriction of 
not training or building the model iteratively to reach 
certain objective [6]. 

The work most closely related to this one is the 
“simple decomposition method” idea presented in [20]. 
The key idea in [20] is to avoid the burden of general 
linear constraints from the optimization and convert 
it to a simple bound-constrained problem. In our 
formulation we do not get rid of any linear constraints. 
Instead we take advantage of the relationship between 
the set of linear constraints and the bound information 
of the design variables. To the best of our knowledge, 
none of the existing literature discusses formulating 
this non-trivial regularized approximation from prior 
knowledge of constraints in the optimization problem 
that leads to a sparse one-class SVMs. Our main 
contributions in this paper are: 

• We propose an optimization problem with an addi- 
tional meaningful criterion. The proposed formula- 
tion is acceptable and still equivalent to the classi- 
cal SVM problem in terms of generalization error. 
The proposed formulation is very simple and can 
easily be implemented. 

• We provide reasoning on why the proposed algo- 
rithm produces sparser solutions which in return 
improves the testing time by several factors. 

• The proposed algorithm is several orders of mag- 
nitude faster than existing learning method and at 
the same time it retains the accuracy of the bench- 
mark algorithm. We provide theoretical explana- 
tions for this. 

• We demonstrate the capability of the algorithm in 
handling simulated data sets with varying sparsity 
and real life data from airlines industry by mea- 
suring the performance of the proposed technique 
using different metrics, such as frequency, accuracy, 
sensitivity, ranking, and run time. 

• We provide some useful insights regarding the 
effectiveness of proposed technique based on the 
experimental and simulation study. 


3 Preliminaries on Single Class Support Vector 
Machines 



Figure 1: This figure illustrates the geometric interpre- 
tation of optimal hyperplane for one class Support Vec- 
tor Machines. The empty circles, solid circles and the 
dotted circles represent non-support vectors, bounded 
support vectors and unbounded support vectors respec- 
tively. 

Scholkopf [16] introduced one-class SVMs as an 
unique member of the SVMs family. As the name 
suggests, one-class SVMs is a unsupervised learning 
method which is trained on a single class and used for 
estimating the density of the target support objects. In 
standard one-class SVMs problem, we are given a set 
of labeled training data T> = {(ifj, 2/i )}™_ x in the input 
space R, where x. t G R d and the corresponding labels 
yi G {+1}- The key idea is to construct a hyperplane 
that can separate outliers from the rest of the training 
examples, as shown in Fig. 1. At the end, we wish to 
develop a decision rule from the seen samples, so that 
when a new point comes in, we will be able to assign a 
class level depending on whether the model has seen this 
point or not. Since a V — 1 dimensional hyperplane can 
exist in the TV-dimensional feature space, the primary 
task is to find the optimal separating hyperplane that 
can maximize the margin between the training examples 
and the origin, which is the lone representative of the 
second class with negative label. This can be achieved 
by solving an optimization problem that leads to a set of 
training points, termed “Support Vectors” (SVs) which 
are the representatives of the decision boundary. 

Let us define a function <j> that can be used to 
map variables from the input space to the feature 
space J 7 , i.e. <f> : R d — > T. In feature space the 
inner product (x, , x 7 - ) property holds, where x, := 
<j>(xi). While evaluating the dot product in the feature 
space, the explicit calculation using mapped feature 
<j> can be avoided by simply evaluating the kernel 
function i.e. k(xi,Xj) := (<f> (xi) , <j> (xj)) . However 
in order to do so, the chosen inner-product kernel 




Figure 2: In this figure we provide the illustration 
of higher dimensional mapping for linear separation 
fields. It shows that even if the patterns are nonlinearly 
separable in input space, it is possible to map them 
in higher dimensional feature space where they may be 
linearly separable. Here </>(.) is the mapping function. 


must satisfy Mercer’s theorem [4]. We will see an 
example of a normalized Longest Common Subsequence 
(nLCS) based kernel function later where we discuss our 
experimental studies. 

3.1 Derivation of the Optimization Problem: 

In order to construct the optimal hyperplane we solve 
the following primal problem (Eqn. 3.1). The expres- 
sion in Eqn. 3.1 simply means, “maximize the margin 
between the origin and the hyperplane (Fig. 1) for a 
nonseparable problem [16] in the feature space”. The 
primal problem is represented as 


minimize P (w, p, &) = \ww T + ^ ^ & - p 

c i=l 

(3.1) 

subject to (w.(f>(xi)) > p - ii > 0, v G [0, 1] 


where v is an user specified parameter that defines 
the upper bound on the training error, and also the 
lower bound on the fraction of training examples which 
are support vectors, £ is the non-zero slack variable, p 
is the offset, (j>(xi) represents the transformed image of 
Xi in the Euclidean space and i G [f]. The position of 
the optimal margin relative to the origin is represented 
by p, which in fact is the margin of separation between 
positive and negative class. 

Using Lagrangian and some simple manipulations, 
the constrained primal problem (Eqn. 3.1) is converted 
to a dual problem [4], 


minimize 

(3.2) 

subject to 


Q 


1 

2 


* <3 


0<a i <i,l-5:ai = 0, 


* e [0, l] 


It is not difficult to show that p = J2i a ik i x i> x j) 
for the solution w and pattern Xj corresponding to 
0 < a* < 1 while setting & = 0. 

Weights to training points are Lagrangian multipli- 
ers (a) that ranges between 0 and 1. There exist at least 
vt non-zero Lagrangian multipliers. Support Vectors 
(SVs) are training points {xi : i £ [H ] , a* > 0} with non- 
zero weights. Non-margin or bounded SVs are the ones 
with {xi : i £ [£], ai = 1} and margin or unbounded SVs 
are those with {xi : i £ [£] , 0 < a* < 1}. 

Once a is known, SVMs compute the decision 
function, 


Engineering, Mathematics etc. Given a set of criteria 
q(x) = X/, \fi( x ) and a set of feasible points fi G R" , 
the key idea is to find the optimal point x £ fl, for 
which q(x) < q(z),Vz from the feasible set. This can be 
expressed as, 


min q(x) 

l£E" 

subjected to Cj = 0, i £ e 
(4.4) Cj > 0 ,i£l 

where Cj = 0, i £ £ are equality constraints and c, > 0, 
i £ I are inequality constraints. There are methods [10] 
that also find multiple solutions that cover the full set of 
possible trade-offs between the various objective func- 
tions. The selection of these criteria are typically based 
on the knowledge of optimal design or control variables, 
summary statistical, model assumptions, target objec- 
tives like smoothing, de-noising etc. A detailed descrip- 
tion of techniques that take care of the trade-off between 
multiple criteria can be obtained in [2] . 


f(xj) = sign{ ^ aik(xi,Xj) + ^ k(xi,Xj) - p) 
iez m iei„ m 

(3.3) 

where Iq = {i : a.i = 0}, I m = {i : 0 < an < 1} 
and I nrrl = {i : a* = 1} are the sets of indices 
of Lagrangian multipliers corresponding to non-SVs, 
marginal and non-marginal support vectors respectively. 
The pseudo-code of one-class SVMs algorithm is shown 
in Algorithm 1. Given a test point Xj , if f(xj) < 0, then 
Xj is predicted to be an outlier, whereas if f(xj) > 0, 
then Xj is predicted to be normal. 


Algorithm 1 Single Class SVMs Algorithm 
l: Input Vector: X = {x\,X 2 ....x m ,z}, X£ lZ d . 

2 : Map Features: K(<j)(xi),<j){xj))). 

3 : Solve Eqn. 3.2 to obtain a corresponding to 
Support Vectors (SVs). 

4: Calculate bias, p = a kK(^(x)^(xk))- 

5 : Calculate score, f(z ) = Ylk=i a kK($(xi)®{z))- 
6: if f(z ) > p then 
7 : return 1 

8 : else 
9 : return 0 

10: end if 


4 The Multi-criterion Optimization 

The multi-criterion optimization problem has several 
fascinating applications that compromise Economics, 


4.0.1 Bi-criterion Formulation: The Main Idea 

To make the dual formulation more effective, we take 
into account the structure of the linear constraints and 
their dependencies on the variable bounds. We do this 
approximation by incorporating a second-order penalty 
function, keeping in mind the description of support 
vectors and the properties of the associated Lagrangian 
multipliers/ weights. The bi-criterion formulation of 

one-class SVM takes the form of, 

min aeSR n Q =^ a T Ka- 

(4.5) subject to 0<a<— f, l T a=l, v £ [0, 1] 

where a is the vector of Lagrangian multipliers and 
K is the similarity matrix. The motivation behind 
the additional penalty term is that the bi-criterion 
formulation seeks the values of the design variables 
closest to the extreme (upper or lower) bounds of the 
design variable while simultaneously minimizing the 
first term. Only training points with non-negative 
weights are considered as support vectors. It is very 
intuitive that the equality constraints are satisfied with 
the least number of design variables only when the 
weights corresponding to those variables tend to be close 
to the maximum possible value (i.e. at = ^). Hence 
by solving the above problem we expect to obtain a 
sparse solution. In the following sections we will see that 
the quadratic penalty function is compatible with the 
method of direction search and plays a significant role 
to reach the optimal solution using less computations. 


Proposition 4.1. Bi-criterion formulation (Eqn. 4-5) 
of SVMs is convex. 

Proof. Solving this optimization problem means that we 
need to minimize two convex criterion on a defined set: 

• The Hessian of the objective function <3(in Eqn. 
3.2) of classical One-class SVMs problem is given 
by V^Q(x) = K, where K £ 5" is a symmetric 
kernel matrix. Since we make sure that the defined 
kernel matrix is positive definite or positive semi- 
definite, it implies that the objective function is 
either strictly convex or convex. 

• Since the controlled criterion takes the form of a 

squared Euclidean norm h = 1 — a) T (^1 — a), 

h is strictly convex. 


min Qe sffn Q = -a T Ka + C T a 
(4.6) subject to 0 < a < —^e, Fa = b, v £ [0, 1] 

where e = 1, 6 = 1, K = K — 2X1 , C = -^e. 
The optimization problem defined above is a quadratic 
programming problem with a linear set of constraints 
and we would like to solve this problem in a finite 
number of steps using “Active set” algorithm. In active 
set algorithm, the first step is to compute a feasible start 
point which satisfies both the bounds and the equality 
constraints. Given a feasible start point op , the task is 
to iteratively minimize the objective function. However 
this requires us to find the suitable direction of search 
and a non-negative step size. 


• Given 0 < ai < -A, the constraint in Eqn. 3.2 

defines convex set as ]TT a i is convex. 

Here we will briefly discuss the nature of solutions 
that bi-criterion formulation may yield. With the 
control parameter A = 0 (Eqn. 4.5), we would get 
the classical solution. However with a non-zero control 
parameter, (say A = 1), the quadratic term leads to 
sparser solutions. Suppose we are given £ training 
samples and model parameter v £ [ 0 , 1 ], and define 
p = v£. The upper bound of the constraint (Eqn. 
3.2) is A The second order term of the objective 
function attains its maxima at 0 and - and therefore, 
the solution will tend to push the a’s toward the extreme 
values in the range. Since ]>A a* = 1 and ai can 
attend a maximum value of 1, we can, without the 
loss of generality, decompose the previous expression as, 

Eili = £i=i on + Efep+i on=pl + 0 = 1. Hence 
the solution is a set of p training inputs with maximum 
weights i.e, a\ = ai = 013 = • • • = a v = A If p is not an 
integer, it is rounded to the nearest integer value (say p) 
and the above process is repeated. This results in p — 1 
design variables attaining the upper bound and thus 
forcing the remaining ones to take any values from the 
range defined by 0 < ai < ^ such that J2i a i=i = 1 is 
satisfied. Therefore with a A which is large enough, the 
optimization is pushed toward a solution that is more 
sparse than the classical solution. 

4.1 Active Set: The Quadratic solver “Active 
set” algorithm [9, 15] is very popular in solving QP 
problems with constraints, especially when the positive 
semi-definite matrix K is dense in nature. Equation 4.5 
can be rewritten as, 


Definition At any a, the active set A(a) consists of 
free variable indices from the equality constraints to- 
gether with the indices of variables which are temporar- 
ily fixed on their upper/lower bounds. 


4.1.1 Reduction of Problem Size At any k th it- 
eration, suppose we have some a We would like to 
create a partitioning of the active set. If “X” refer to en- 
tities corresponding to design variables whose values are 
temporarily fixed and the complement set of variables, 
termed as free variables, are denoted by “A” , we can cre- 
ate a partitioning of current points a i.e. aj. = [a R a x ] 
and n = [n x n R ] where n is the cardinality of the design 
variable. Similarly we can also define the partitions is 
A = [A R A X ] and C = [ C R C X ]. We can also define, 


K = 


f^R,R f^R,X 
f>X,R f(X,X 


where K X R = {K R,X ) T . At iteration k, we can 
define a working set Wk which is constructed by t 
equality constraints only. Temporarily discard all the 
fixed variables so that we end up with n = n R and 
t = me, where £ denotes the total number of equality 
constraints. The direction of search is computed by 
solving the following reduced problem, 


min Q = \ak RT K R ’ R a R + C rT a R 

c*ER n 2 

1 T 

subjected to 0 < a R < —e,A R a R = b 

(4.7) -A xT a x ,v&{ 0,1] 

Once the reduced problem is formed, the next task 
is to check if Q(a R ) is minimized for the given a R and 


Wfc. If Q(ot k R ) is not minimized, we need to compute 
the direction and the step size such that Q(a k + i R ) < 
Q(a k R )■ The pseudo code of the algorithm to compute 
the direction and the step size is shown in Algorithm 
2. The bi-criterion formulation tends to push the a’s 
toward the extreme values in the range. However the 
optmization prefers a.i to attend the maximum value of 
i to maintain a finite step size in the suitable direction. 
As a consequence, the number of bounded variables 
quickly increases, thus resulting in a much smaller 
problem (Eqn. 4.7) to solve. The reduced QP problem 
(step-3, Algorithm 2) can be solved using elimination of 
variables or Lagrangian Methods. 


Algorithm 2 Sub-problem of active set algorithm 


1: 

2 : 


3 : 


4 

5 

6 

7 

8 
9 

10 

11 

12 


13 : 


Input: a k R , K R ’ R , C R . Let direction is denoted by 
dk R = a k +i R - a k R and g k = a k RT k R ’ R + C R . 
Q(oi k +i R ) = Q(a k R + d k R ) = |( a k R + 
d k R ) T K R ’ R (a k R + d R k ) + C RT (a k R + d R k ) = 

Q(a k R ) + \d k RT K R ’ R d k R + g k T d k R . 

Modified sub-problem 

min d \d k RT K R ' R d k R + g k T d k R 

subjected to, A rT d k R = 0 
if d k R ^ 0 then 

Calculating step size along the direction d k R 
if a k R + d k R is feasible then 
set a k+ i R = d k R + a k R 

else 


set a k+ i R = 7 k d k R + a k R , step size <E [0, 1] 

end if 

else 

Check for KKT condition 


j^R,R R 


A Rl 


0 


d k 

-6 


R 1 


a k R k R ’ R + C R 
0 


end if 


5 Experiments and Discussions 

In this section we conduct computational experiments 
of bi-criterion SVMs and present some studies compar- 
ing bi-criterion and classical SVMs. In our analysis, we 
considered two very different data sets: one real-world 
FOQA (Flight Operations Quality Assurance) data and 
another simulated data set as benchmark applications. 
The aviation data is representative of one of the most 
complex engineering systems with very large size and 
dimensionality. Such a domain also poses a real chal- 
lenge in identifying anomalies in high-dimensional, mul- 


tivariate data sets containing discrete, categorical, and 
continuous features. Therefore it is an ideal platform to 
test the accuracy and scalability of anomaly detection 
algorithms. The simulation based study was proposed 
to conduct a proof- of-concept analysis that demon- 
strates the performance and effectiveness of the pro- 
posed bi-criterion algorithm under different test con- 
ditions. Both bi-criterion and classical one class SVMs 
algorithms were tested on Linux cluster that comprised 
of 16 slave nodes, each of which is a dual processor 
1 — U server containing two, quad-core Intel Xeon pro- 
cessors @ 2.6 6GHz totaling 128 cores and 128 GB Ram 
(1 Gb/Core). It is controlled by two master nodes and 
has 30 Tb storage. Under each test condition, the de- 
sign variable of the optimization from bi-criterion and 
classical one class SVMs were initialized with the same 
random set to preserve consistency. 


5.1 Airlines Data: A Realistic Scenario The 

real world data set chosen for analysis is from a com- 
mercial airlines. The data is obtained from medium 
range narrow body passenger aircraft. In our current 
analysis we considered a total of 2048 flights, a small 
subset of which landed at the same airport. Each flight 
consists of 365 parameters acquired at 1 Hz. Our work- 
ing data set consists of the decent portions of the flight 
from 10,000 ft to touch-down (average flight length of 
10A samples) and has 104 discrete and 45 continuous 
parameters which were selected based on domain ex- 
perts feedback. For continuous data, each parameter 
in the training and testing data are z-score normalized 
using the statistics of each parameter calculated across 
all training flights. The continuous and discrete data is 
converted to continuous and discrete sequences respec- 
tively. Once the sequences are generated the continuous 
and discrete kernel are separately computed pairwise 
across all possible flight combinations in the training 
set. For pairwise comparison we used longest common 
subsequence based similarity function (Eqn. 5.8). 


(5.8) 


K(xi, Xj) = 


\LCS(xj,Xj)\ 

\J Ixi Ixj 


where Is is the number of symbols in sequence 
x. Given two sequences the common subsequences of 
sequences Xi and Xj is identified. The longest such 
subsequence of x \ and Xj is called the longest common 
subsequence (LCS) and is denoted by LCS(xi,Xj ) and 
\LCS(xi, Xj)\ is its length. 

Once the kernels are generated, we combine them in 
a convex fashion. Algorithm 3 shows the operations to 
generate the kernel. For details see the original paper [7] 
where we demonstrated Multiple Kernel Anomaly De- 
tection Algorithm (MKAD) algorithm that can detect if 


the discrete pilot inputs combined with the observation 
vector are nominal or off nominal. 


Algorithm 3 Pre-processing steps to generate a kernel 
1: Continuous Input : C = {xi c , X 2 c----Xmc, z c }, 

C £ lZ d , Discrete Sequence Input : 

S — j X2 s % ms 3 Z 3 } 5 S £ 1Z . 

2 : Generate Continuous Sequence: 

q; %2q %mqi Zq\ — 

SAX X2c %mci [?] ■ 

3: Generate Continuous and Discrete Features: 
{<t>{xi q ),(t>{x 2 mq) -> 4>(z q )} and 

W(xi s ), <t>{%2 s ), ■ ■ ■ </>(x ms ), 

4: Combine kernel: 

ftqS-q((j)(Xiq') , (j){Xjq')^ T (3 s K s ((j)(Xi s ') , <f)(Xj g')') . 


Active-set algorithm has been used to solve the 
quadratic problem. Through out this experiment, some 
of the user defined inputs for example, kernel matrix, 
initialization vector, v parameter, stopping criteria, etc., 
were kept consistent for both the algorithms. From run 
to run, the design variables were randomly initialized 
with values between 0 and 1 . However for any particular 
run both the algorithm started from the same initial 
point. In the first set of experiments, both models were 
built with training sizes varying from 200 samples up 
to 2000 sample points with v — 0.05 and the number 
of support vectors were recorded for each case. These 
results are unique and reproducible for the given data 
and parameter settings. Figure 3 shows that bi- 
criterion SVMs always produces fewer support vectors 
than the classical approach for different training sizes 
and the reduced set size is typically the lower bound 
of the number of total support vectors i.e. v times the 
number of the training points. 

Figure 4 compares the distribution of the weights 
(ai for i £ [7]) corresponding to the support vectors for 
a case where we used 2000 sample points for training 
and set v = 0.05. For classical SVMs there are more 
instances where weights are scattered in-between the 
bounds. However for bi-criterion formulation we see all 
weights lie on the upper bound. Fig. 3 and Fig. 4 
complement each other and show that our algorithm 
produces fewer support vectors by forcing weights to- 
ward the upper and lower bounds. In summary, with 
majority of the Lagrangian multipliers/ weights on the 
upper bound, the model results in a much reduced set 
of non-zero weights. 

Here we extend our observation from Fig. 3. We 
have seen that bi-criterion SVMs results sparser solu- 
tion when compared to classical model. The analysis 
(Fig. 4) showed that classical solution consists of 331 



Figure 3: Figure comparing the number of support 

vectors obtained from the bi-criterion and classical 
SVMs technique for different training sizes over a single 
run. For each and every run, bi-criterion formulation 
converges with a sparser solution and thus outperformed 
classical SVMs formulation. 
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Figure 4: In this figure we compare the distribution of 
the weights of support vectors obtained from the clas- 
sical and bi-criterion SVMs. We can observe that most 
design variables in bi-criterion formulation corresponds 
to the “upper bound” (i.e. see Eqn. 3.2). For clas- 
sical SVMs there are some instances where the design 
variables hold values between upper and lower bounds. 

non-zeros weights while bi-criterion SVMs produced 308 
which is exactly vx 2048, the number of training sam- 
ples. An initial investigation found that solutions from 
both these methods have a total of 304 support vectors 
in common and that jointly compromises approximately 
97—98% of the total weights (see the linear constraint in 
Eqn. 3.2) which is unity. To account for the remaining 
weight, bi-criterion SVMs proposes 4 unique SVs while 
the classical assigns 27 SVs which can very well be some 
source of redundancy. These results are summaries in 
table 1. 


Table 1: Here we compare classical and bi-criterion 
SVMs to check the presence and the influence of redun- 
dancy. The analysis showed that 304 indices (of Sup- 
port Vectors) are common in both solutions and they 
jointly compromises approximately 97 — 98% of the total 
weights. To account for the remaining weight, classical 
SVMs uses approximately 7 times the number of unique 
SVs used by bi-criterion SVMs. 


Algorithms 

Overlapping index 
(influence) 

Unique index 
(influence) 

MKAD 

304 

27 


(97.17%) 

(2.83%) 

Bi-criterion 

304 

4 

MKAD 

(98.7%) 

(1.3%) 


25 



Classical one-class SVMs Bi-criterion one-class SVMs 


Figure 5: The figure shows the run time analysis for 
classical and bi-criterion SVMs under different initial- 
ization conditions. This experiment was repeated 100 
times with random initializations and the running time 
were recorded. These are observed run times with ran- 
dom initializations. It is clear that bi-criterion formula- 
tion performs much better compared to classical SVMs. 


In Fig. 5 we show the resulting training time (in 
hours) for the exact solution and bi-criterion formu- 
lation with 2000 sample points as training points and 
v = 0.15. In the box plot, we show the mean training 
time over 100 runs and their corresponding error bars. 
The mean rum times are 21.64 hours and 2.43 hours for 
classical and bi-criterion SVMs respectively. The stan- 
dard deviations are 2.4 and 0.23 for the respective mod- 
els. It can be observed that the proposed formulations 
consistently performs on average 10 times faster than 
the classical one-class SVM model for the given model 
parameter settings. This performance gain factor is ex- 
pected to increase with increasing training set size. In 
a separate experiment, we repeated the same case with 
randomly initializations from the feasible region defined 


by the bound constraints. This led us to further gain in 
run time. This is because the optimization routine does 
not spend any time looking for a initialization set from 
the feasible region. However this observation is true for 
both the algorithms. 




Classical one-class SVMs 




Bi-criterion one-class SVMs-' 


* $ 


Sorted index of the top 50 anomalous in training observations 


Figure 6: Normalized scores of the top 50 abnormal 
entries detected in the FOQA training set data. Both 
the scores were arranged in a descending order of 
the classical algorithm’s score. This experiment was 
repeated 100 times with random initializations. This 
figure shows that the bi-criterion algorithm almost 
always orders data points the same way as the classical 
algorithm. 

In this section we present some results on prediction 
performance. In this analysis, we asked both the models 
to predict the top 50 outliers from the training pool 
of 2048 and we compare their associated outlier scores 
and ranking. We sorted the outliers and thereafter 
normalized them to 1. In Fig. 6 we compare the mean 
score with associated error bars from multiple runs in 
log scale. This experiment was repeated 100 times with 
random initializations. Figure 6 clearly shows that bi- 
criterion SVMs correctly predicts and ranks the points 
in terms of their outlierness in a consistent fashion and 
the outcome is very comparable to observations from 
classical one-class SVMs. 

We have conducted an initial study that describes 
the nature of the solution we obtain for varying A. In 
the bi-criterion formulation, the value of the A decides 
which criterion is weighted more. In Fig. 7, we plot 
the number of support vectors for a wide range of A 
values. The case when we obtain maximum number 
of support vectors is for A = 0 and we normalize the 
entire outcome using the maximum count. What we 
observed is, the number of support vectors drastically 
changes ( 7% change) as we start increasing A from 
0 but as A becomes large enough (greater than 0.5) 
the influence of A on the outcome diminishes and the 
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Figure 7: Figure demonstrating the influence of the 
control parameter A on the performance of the bi- 
criterion SVMs algorithm. There is very small influence 
of the control parameter (A) on the multi-criterion 
optimization outcome for A > |. 

sparsity of the solution is steady for the given data and 
parameter settings. We therefore set A = 1 for all of our 
experiments.. 



Figure 8: A cartoon that represents various levels of 
sparseness that can be observed in the kernel matrix 
K(xi,Xj). Region-1 (i?i) and Region-3 (R 3 ) represent 
extreme scenarios. When all entries of x and y are 
very different from each other and unique, the resultant 
similarity matrix is strictly diagonally dominant (R\). 
With very similar x and y the K matrix will be 
very dense with very similar diagonal and off-diagonal 
elements (R3). Region-2 (R 2 ) represents a case when 
the diagonal and off-diagonal elements are distributed 
over a certain range. 

5.2 A Simulated Study: To test the robustness 

of the bi-criterion formulation, we developed a common 
test platform with a set of diverse test scenarios using 
synthetic data. Till now we have studied the influence 


of the quadratic penalty function (Eqn. 4.5) and the 
control parameter on the outcome. For problems of this 
nature, the property of the kernel matrix ( K ) in Eqn. 
4.5 plays an important role. The implicit mapping into 
feature space based on different similarity functions and 
data sets are bound to conceal different types of density 
structures in the kernel matrix. The main optimization 
algorithm involves quadratic programming which learns 
on these kernel matrices. Here we intend to investigate 
the influence of varying kernel density on model per- 
formance and outcome. When the entries of the input 
data Xi and Xj are very different from each other and 
unique, the resultant similarity matrix is strictly diag- 
onally dominant i.e. \K(xi,Xj)\ > \K( x ii x j)\ jV*. 
The other extreme scenario is when all the entries x 
and y are very similar in feature space and tightly clus- 
tered. The latter will result in a highly dense K ma- 
trix. In Fig. 8, we explain the above scenarios in a 
cartoon form. Region 1 (R±) and Region 3 (R 3 ) rep- 
resent the extremely sparse and highly dense cases, re- 
spectively. Region-2 ( R 2 ) represents a case when the 
diagonal and off-diagonal elements are distributed over 
a certain range. 

We will further illustrate the above scenarios of 
varying sparseness by using synthetic data set. This 
data is randomly generated from the normal distribu- 
tion with user defined mean parameter /r and standard 
deviation parameter a . The resultant kernel matrices 
we generate are symmetric and of size 2048. We force 
the diagonal elements to unity, as this is case for most 
similarity functions (e.g. nLCS function shown in Eqn. 
??) which vary between 0 and 1, where 1 represents the 
highest similarity or exact match. For each combination 
of fi and d, we binned the elements of K into 50 equally 
spaced groups each of which represents a “values range” 
between 0 and 1. In returns we obtain the number of 
elements in each group. In the analysis, we conducted a 
total of 20 different cases where the density distribution 
of the matrix moves from one end of the “values range” 
to the other end. Figure 9 presents some of these exam- 
ples. Subfigure ??-(a) shows an example where all the 
elements of the kernel matrix, beside the diagonals, are 
of extremely small. This is a typical example of diag- 
onally dominant matrix. Subfigure 11 -(c) is the other 
extreme case with very comparable diagonal and off- 
diagonal elements. Subfigure ??-(b) and Subfigure ??- 
(d) represent a simulated and a realistic (aviation data) 
scenario where the off-diagonal elements of the matrix 
K hold values from intermediate ranges. Under each 
test condition, we ran 10 experiments with random ini- 
tializations and recorded the run time and number of 
support vectors for bi-criterion and Classical SVMs al- 
gorithm. To measure the effectiveness of the model, we 
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Figure 9: The above figure represents the histogram plots comparing the distributions of the elements of the kernel 
K under different test cases. Each element of K represents the similarity between two entities. The diagonal 
elements represent self similarities. Subfigure (a-c) was obtained from simulated data while subfigure (d) was 
obtained from a real airlines FOQA data consisting of 2048 flights with 149 continuous and discrete features. 
In subfigure (a) the kernel IF is a diagonally dominant matrix while subfigure (c) represents the case where the 
diagonal and off-diagonal elements of the matrix K are very comparable. Subfigure (b) represents an intermediate 
scenario. 


define the following two performance metrics. 

Definition We defined degree of sparsity, a gain met- 
ric, that calculates the ratio of the counts of non-zero 
Lagrangian multipliers from bi-criterion to that of clas- 
sical SVMs model. This metric explain how effective 
is the proposed model in testing phase. If degree of 
sparsity if high this simply implies that the solution is 
obtained with lesser support vectors. 

Definition The execution time gain is a run time re- 
lated gain metric, that calculates the ratio of the run 
time of bi-criterion to that of classical SVMs model. 
This metric shows how quick is the proposed optimiza- 
tion converges to a solution compared to the benchmark 
method. 

Figure 10 summarizes the results using the perfor- 
mance metrics in a quadrant format for a better vi- 
sual understanding. The two right hand quadrants al- 
ways confirm a sparse solution. Similarly the upper two 
quadrants indicates seep up by some factors. Anywhere 
in the +/ + ve quadrant is the most desired operat- 
ing region where under any circumstances the proposed 
solution is sparse and the execution time is less. A neg- 
ative execution gain means that the classical solution 


converges faster. In Fig. 10, the execution time gain 
is intentionally plotted in log scale to obtain a better 
resolution in order to understand the differences in the 
performance space. As can be seen, bi-criterion formu- 
lation outperforms the benchmark algorithm consider- 
ably in most cases. On all occasions, bi-criterion SVMs 
always reported the lest number of support vectors (i.e. 
vN ) but the solution of the classical method changes 
depending on the density structure of the kernel ma- 
trix. For instance, from diagonally dominant matrix 
(refer Fig. 8 region -1 and Fig. 9 -(a)) classical SVMs 
reports N support vectors which is equal to the num- 
ber of training points. This is because all the training 
examples are so different and unique that all of them 
carries equal weightage to be a support vectors. How- 
ever the convergence time of the classical approach was 
varying a lot. Under this scenario, there were several 
occasions where bi-criterion ran slower than the classi- 
cal SVMs by a couple of factors. But majority of the 
gain was noticed during test phase where to evaluate a 
single test point the classical will have to do at least £ 
times more operations. On the other hand for highly 
dense kernel matrix with very similar diagonal and off- 
diagonal elements (refer Fig. 8 region -3 and Fig. 9 



Figure 10: Performance comparison between bi-criterion and classical SVMs method. The execution time gain is 
in log scale. 


-(c)), it takes extremely long time for the classical to 
converge but the solutions are very similar to those ob- 
tained by bi-criterion formulation. Under this scenario, 
the gain is in the run time while building the model. For 
any other cases the importance of regularization term 
was reflected. There are the test cases (refer Fig. 8 any 
combination of region -1, region -2 and region -3 and 
Fig. 9 -(b) and (d)) where the proposed algorithm out- 
performs the baseline in run-time by several factors and 
also results in a sparse solution. In reality, this is ex- 
actly one expects from an algorithm that can learn much 
faster and produce sparser solution so that the model 
can be used to test large volume of data in short time. 
Obviously, this sort of scaling will be very attractive for 
high-dimensional and dense data matrix, particularly 
when the detection accuracy is well preserved. 

6 Conclusion 

In this paper we devised a version of one-class SVMs 
with an addition to the objective function that leads to 
sparser solutions. We demonstrated that these solutions 
are nearly as accurate as the solutions from the classical 
one-class SVM algorithm but are obtained in much less 
time and/or can be used to classify new examples in 
much less time. We demonstrate that the reduced 
number of support vectors and the resulting reduction 
in running time that we obtain are not sensitive to A 
which is the weight used to control the tradeoff between 
the two terms in our objective function. In combination 
with our earlier development of MKAD [7] , we are able 


to identify anomalies in data from commercial aviation 
accurately and in a practical amount of time without 
losing any of the advantages of kernel methods such as 
global optimality for a given training set. 

We plan to investigate further efficiency and scala- 
bility improvements by developing distributed and on- 
line versions of our algorithm. Because our algorithm 
only involved a simple change to the objective function 
and did not require any changes to the solver, we can 
utilize any other solvers used for SVMs. We plan to 
investigate how strong our efficiency improvements re- 
main when using other solvers. 
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